Corpus-based Name Standardization
نویسنده
چکیده
Variation in the spelling of names has various origins, many of which many are difficult to describe by rule. We present a method that uses both rules and a similarity measure of a probabilistic nature, and which can make use of existing onomastic corpora. Rules first convert an unknown name to a semiphonemic form. Then a selection is made of possible candidates in the onomastic corpus. For this set, the similarity to the unknown name is computed and a decision procedure chooses the best candidate. If no specific onomastic corpus is available, the method provides a tool for a clustering of similar names. The method is demonstrated on a corpus of 49.193 first names from 18th century parish registers, in the availability of a Dutch corpus with 22.579 variants of 4.482 base forms of first names.
منابع مشابه
Globalization, Standardization, and Dialect Leveling in Iran
This paper is an attempt to shed light on the effects of modernization, urbanization, monolingual educational system, and mass media as well as the process of globalization on dialect leveling among Persian dialects. In so doing, the first part of the paper elaborates on the relationship between globalization and sociolinguistics, and on the concept of standardization. Also, it discusses some ...
متن کاملCorefrence resolution with deep learning in the Persian Labnguage
Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...
متن کاملContent of Linguistic Annotation: Standards and Practices (CLASP) Research Activities and Findings
25 members of the computational linguistics research community participated in a meeting at New York University on November 7, 2009 to address several difficult questions about the standardization of linguistic content in corpus annotation, where we define the term standardization to include all efforts to improve compatibility or interoperability between annotation content, including not only ...
متن کاملGrammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary
In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...
متن کاملServices integration, professional autonomy and standardization. Representations of standardization among case managers in the case of integrated service networks implementations
Purpose: Social work (SW) practices are undergoing major transformations generated by change in the governance of health and social policies. These transformations are based on two logical performance, one managerial, resting on New Public Management principles, and another clinic, supported by evidence based practices. The implementation of integrated services is traversed by these two logics ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- History and Computing
دوره 6 شماره
صفحات -
تاریخ انتشار 1994